bayes predictor
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Health & Medicine (0.46)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Vision (0.68)
Conditional independence testing under misspecified inductive biases
Conditional independence (CI) testing is a fundamental and challenging task in modern statistics and machine learning. Many modern methods for CI testing rely on powerful supervised learning methods to learn regression functions or Bayes predictors as an intermediate step; we refer to this class of tests as regression-based tests. Although these methods are guaranteed to control Type-I error when the supervised learning methods accurately estimate the regression functions or Bayes predictors of interest, their behavior is less understood when they fail due to misspecified inductive biases; in other words, when the employed models are not flexible enough or when the training algorithm does not induce the desired predictors. Then, we study the performance of regression-based CI tests under misspecified inductive biases. Namely, we propose new approximations or upper bounds for the testing errors of three regression-based tests that depend on misspecification errors. Moreover, we introduce the Rao-Blackwellized Predictor Test (RBPT), a regression-based CI test robust against misspecified inductive biases. Finally, we conduct experiments with artificial and real data, showcasing the usefulness of our theory and methods.
In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
Wakayama, Tomoya, Suzuki, Taiji
This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- (9 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Supplementary materials - NeuMiss networks: differentiable programming for supervised learning with missing values A Proofs
Proof of Lemma 2. Identifying the second and first order terms in X we get: The last equality allows to conclude the proof. Additionally, assume that either Assumption 2 or Assumption 3 holds. This concludes the proof according to Lemma 1. Here we establish an auxiliary result, controlling the convergence of Neumann iterates to the matrix inverse. Note that Proposition A.1 can easily be extended to the general case by working with M (61) i.e., a M nonlinearity is applied to the activations.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > Canada > Quebec > Montreal (0.14)
- Europe > France (0.04)
- (2 more...)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Health & Medicine (0.46)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Vision (0.68)
Review for NeurIPS paper: NeuMiss networks: differentiable programming for supervised learning with missing values.
The paper attacks the classical problem of linear regression with missing values. It computes the Bayes predictor in several cases with missing values and then uses Neumann series to approximate the Bayes predictor. This approximation is then used to design Neural Networks with RelU functions. The propositions describing self-masking missingness, appears to be a novel concept, are interesting but can be considered slightly restrictive because of Linear Gaussian assumptions. However, both the results and the methods should be of interest to NeuriPS 2020 community.